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Abstract 

This paper presents a general framework for ex¬ 
ploiting the representational capacity of neural 
networks to approximate complex, nonlinear re¬ 
ward functions in the context of solving the in¬ 
verse reinforcement learning (IRL) problem. We 
show in this context that the Maximum Entropy 
paradigm for IRL lends itself naturally to the effi¬ 
cient training of deep architectures. At test time, 
the approach leads to a computational complex¬ 
ity independent of the number of demonstrations, 
which makes it especially well-suited for appli¬ 
cations in life-long learning scenarios. Our ap¬ 
proach achieves performance commensurate to 
the state-of-the-art on existing benchmarks while 
exceeding on an alternative benchmark based on 
highly varying reward structures.Einally, we ex¬ 
tend the basic architecture - which is equivalent 
to a simplified subclass of Eully Convolutional 
Neural Networks (ECNNs) with width one - to 
include larger convolutions in order to eliminate 
dependency on precomputed spatial features and 
work on raw input representations. 


1. Introduction 

Recent successes in machine learning, vision and robotics 
have lead to widespread expectations that machines will 
increasingly succeed in applications of real value to the 
public domain. A central tenet of any vision delivering 
on this promise revolves around learning from user inter¬ 
actions. Inverse reinforcement learning (IRL) is playing a 
pivotal role in these developments and commonly finds ap¬ 
plications in robotics ( [Argali et al.[ |2009| ) where it allows 
robot to learn complex behaviour from human demonstra- 


tions and also in fields of cognition ( 

Baker et al. 2009) and 

preference learning ([Ziebart et al.[ 

2008) where it serves 
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Eigure 1: Eully Convolutional Neural Network for reward 
approximation in the IRL setting. The network serves to 
model the relationship between input features and final re¬ 
ward map. 


as a tool to better understand human decisions or medicine 
( Asoh et J^|2Q13| ) to predict patient response to treatment. 
The objective of inverse reinforcement learning (IRL) is to 
infer the underlying reward structure guiding an agent’s be¬ 
haviour based on observations as well as a model of the 
environment. This may be done either to learn the reward 
structure for modelling purposes or to provide a method 
to allow the agent to imitate a demonstrator’s specific be¬ 
haviour ( jRamachandran & Amirj |2007| ). While for small 
problems the complete set of rewards can be learned explic¬ 
itly, many problems of realistic size require the application 
of generalisable function approximations. 


Much of the prior art in this domain relies on parametri- 
sation of the reward function based on pre-determined fea¬ 
tures. In addition to better generalisation performance than 
direct state-to-reward mapping, this approach enables the 
transfer of learned reward functions between different sce¬ 
narios with the same feature representation. A number of 
early works from ( jZiebart et al.j |2008| ), ( jAbbeel & Ng| 
|2004| ), ( [Lopes et~aL 2009 ) and ( Ratliff et al.| 2006| , ex- 
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press the reward function as a weighted linear combination 
of hand selected features. To overcome the inherent limi¬ 
tations of linear models, ( |Choi & Kim| |2013| ) and ( |Levine| 
et al.[ |2010| ) extend this approach to a limited set of non¬ 
linear rewards by learning a set of composites of logical 
conjunctions of atomic features. Non-parametric methods 
such as Gaussian Processes (GPs) have also been employed 
to cater for potentially complex, nonlinear reward functions 
( [Levine et al.[ [201 Ij ). While in principle this extends the 
IRL paradigm to the flexibility of nonlinear reward approx¬ 
imation, the use of a kernel machine makes this approach 
scale badly with higher numbers of training data and prone 
to requiring a large number of reward samples in order 
to approximate highly varying reward functions ( jBengio 
et al. 2007|). Even sparse GP approximations as used in 
( [Levine et al. 2011 ) lead to a query complexity time in de¬ 
pendency of the size of the active set or the number of ex¬ 
perienced state-reward pairs. Situations with increasingly 
complex reward function leading to higher requirements re¬ 
garding the number of inducing points can quickly render 
this nonparametric approach computationally impractica¬ 
ble. Furthermore, in comparison to ( [Babes et al. 2011), 
we focus on a singular expert in what Anally leads to an an 
end-to-end learning scenario in section [4^ from raw input 
to reward without compression or preprocessing on the in¬ 
put representation. To our knowledge the only other work 
considering the use of deep networks is given by ( [Levine 
et al.[[20l3] ), who focus on directly approximating policies 
with neural networks but shortly refer to the possibility of 
extension for cost function learning with neural networks. 

In contrast to prior art, we explore the use of neural net¬ 
works to approximate the reward function. Neural Net¬ 
works already achieve state-of-the-art performance across 
a variety of domains such as computer vision, natural lan¬ 
guage processing, speech recognition ( Bengio et al.[[2012'| ) 
and reinforcement learning ( [Mnih etak 2013[ ). Their ap¬ 
plication in IRL suggests itself due to their compact repre¬ 
sentation of highly nonlinear functions through the compo¬ 
sition and reuse of the results of many nonlinearities in the 
layered structure ( [Bengio et al.[ [2007| ). In addition, NNs 
provide favourable computational complexity (0(1)) at 
query time with respect to observed demonstrations, which 
provides for scaling to problems with large state spaces and 
complex reward structures - circumstances which might 
render the application of existing prior methods intractable 
or ineffective. With the approach represented in Figure 
a state’s reward can be determined either solely based on 
its own feature representation or - in using wider convo¬ 
lutional layers - analysed in combination with its spatial 
context. The applied architectures are Fully Convolutional 
Neural Networks, which - by skipping the fully connected 
flnal layers common in classiflcation tasks - preserve spa¬ 
tial information and can create an output of the same spa¬ 


tial dimension and size as the input. Recent examples for 
the application of FCNNs focus on dense prediction: in¬ 


cluding pixel-wise semantic segmentation by ( Long et 
[2014[ ), sliding window detection and prediction of object 
boundaries ( Sermanet et ^ 2013[), depth estimation with 
single monocular images ( [Liu et al. 2015 ) and human pose 
estimation in monocular images ( [Tompson et al.[[20T^ . 


Our principal contribution is a framework for Maximum 
Entropy Deep Inverse Reinforcement Learning (DeepIRL) 
based on the Maximum Entropy paradigm for IRL ( [Ziebart 
et al.[[20()8) ), which lends itself naturally for training deep 


architectures by leading to an objective that is - without ap¬ 
proximations - fully differentiable with respect to the net¬ 
work weights. Furthermore, we demonstrate performance 
commensurate to state-of-the-art methods on a publicly 
available benchmark, while outperforming the state-of-the- 
art on a new benchmark where the true underlying reward 
has complex interacting structure over the feature represen¬ 
tation. In addition, we emphasise the flexibility of the ap¬ 
proach and eliminate the requirement of preprocessing and 
precomputed features by applying wider convolutional lay¬ 
ers to learn spatial features of relevance to the IRL task. 
This enables the application without manually crafted fea¬ 
ture design as long as the state space is constrained to a 
regularly gridded representation allowing for convolutions. 


We argue that these properties are important for practical 
large-scale applications of IRL as can be seen in life-long 
learning approaches with often complex reward functions 
and increasing scale of demonstrations requiring high ca¬ 
pacity models and fast computational speeds. 


2. Inverse Reinforcement Learning 

This section presents a brief overview of IRL. Let a Markov 
Decision Process (MDP) be deflned sls M = {5, T, r}, 
where S denotes the state space, A denotes the set of pos¬ 
sible actions, T denotes the transition model and r de¬ 
notes the reward structure. Given an MDP, an optimal pol¬ 
icy TT* is one which, when adhered to, maximizes the ex¬ 
pected cumulative reward. Furthermore, an additional fac¬ 
tor 7 G [0,1] may be considered in order to discount future 
rewards. 

IRL considers the case where a MDP speciflcation is 
available but the reward structure is unknown. Instead, 
a set of expert demonstrations V = {^ 1 , ^ 2 , •••, 

is provided which are sampled from a user policy tt, 
i.e. provided by a demonstrator. Each demonstration 
consists of a set of state-action pairs such that q = 
{(so,ao), (si,ai),..., {sk^clk)}- The goal of IRL is to un¬ 
cover the hidden reward r from the demonstrations. 

A number of approaches have been proposed to tackle the 
IRL problem (see, for example, ( [Abbeel & Ng[ [20Q4[ ), 
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(|Neu & Szepesvari| |2Q12| ), ( [Ratliff et al.[ |2006| ), ( jSyed 


|& Schapire 20071). An increasingly popular formulation 
is Maximum Entropy IRL ( jZiebart et al.j |2008| ), which 
was used to effectively model large-scale user driving be¬ 
haviour. In this formulation the probability of user pref¬ 
erence for any given trajectory between specified start and 
goal states is proportional to the exponential of the reward 
along the path 


P(<^|r) oc exp{ rs,a}- 


( 1 ) 


s,ae<^ 


As shown in Ziebart’s work, principal benefits of the Max¬ 
imum Entropy paradigm include the ability to handle ex¬ 
pert suboptimality as well as stochasticity by operating on 
the distribution over possible trajectories. Moreover, the 
Maximum Entropy based objective function given in Equa¬ 
tion enables backpropagation of the objective gradi¬ 


ents to the network’s weights. The training procedure is 
then straightforwardly framed as an optimisation task com¬ 
putable e.g. via conjugate gradient or stochastic gradient 
descent. 


2.1. Approximating the Reward Structure 

Due to the dimensionality and size of the state space in 
many real world applications, the reward structure can not 
be observed explicitly for every state. In these cases state 
rewards are not modelled directly per state, but the reward 
structure is restricted by imposing that states with similar 
features, x, should have similar rewards. To this end, func¬ 
tion approximation is used in order to regress the feature 
representation onto a real valued reward using a mapping 
g : ^ M, with N being the dimensionality of the fea¬ 

ture space such that 


r = 9 {f, 0 ). (2) 


A feature representation, /, is usually hand-crafted based 
on preprocessing such as segmentation and manually de¬ 
fined distance metrics, but can be learned based on the pro¬ 
posed framework - as shown in section |4.3[ Eurthermore, 
the application of feature based function approximation en¬ 
ables easier generalisation and transfer of models. 


The choice of model used for function approximation has 
a dramatic impact on the ability of the algorithm to cap¬ 
ture relationship between the state feature vector / and user 
preference. Commonly, the mapping from state to reward 
is simply a weighted linear combination of feature values 


9if,0)=9^f. (3) 


This choice, while appropriate in some scenarios, is sub- 
optimal if the true reward can not be accurately approxi¬ 
mated by a linear model. In order to alleviate this limitation 


( Choi & Kim||2013| ) extend the linear model by introducing 
a mapping ^ {0,1}^ such that 




(4) 


Here ^ denotes a set of composite features which are 
jointly learned as part of the objective function. These 
composites are assumed to be the logical conjunctions of 
the predefined, atomic features /. Due to the nature of the 
features used the representational power of this approach is 
limited to the family of piecewise constant functions. 

In contrast, ( [Levine et al.j [201 1[ ) employ a Gaussian 
Processes (GP) framework to capture the potentially un¬ 
bounded complexity of any underlying reward structure. 
The set of expert demonstrations V is used in this context to 
identify an active set of GP support points, Xu, and associ¬ 
ated rewards u. The mean function is then used to represent 
the individual reward at a state described by / 


gif,0,Xu,u)=Kl^K-lu. (5) 

Here Kf^u denotes the covariance of the reward at / with 
the active set reward values u located at Xu and Ku^u de¬ 
notes the covariance matrix of the rewards in the active set 
computed via a covariance function ke{fi, fj) with hyper¬ 
parameters 0. 

Nevertheless, a significant drawback of the GPIRL ap¬ 
proach is a computational complexity proportional to the 
number of demonstrations and the size of the active set 
of inducing points, which in turn depends on the reward 
complexity. While the modelling of complex, nonlinear 
reward structures in problems with large state spaces is 
theoretically feasible for the GPIRL approach, the car¬ 
dinality of the active set will quickly become unwieldy, 
putting GPIRL at a significant computational disadvantage 
or, worse, rendering it entirely intractable. These short¬ 
comings are remedied when using deep parametric archi¬ 
tectures for reward function approximation while keeping 
the accuracy of nonlinear function approximation, as out¬ 
lined in the next section. 


3. Reward Function Approximation with 
Deep Architectures 


We argue that IRL algorithms scalable to MDPs with large 
feature spaces require models, which are able to efficiently 
represent complex, nonlinear reward structures. In this 
context, deep architectures are a natural choice as they ex¬ 
plicitly exploit the depth-breadth trade-off ( [Bengio et al.| 
2007 [ ) and increase representational capacity by reusing the 


computations of earlier nodes in the following layers. 


Eor the remainder of the paper, we consider a network ar¬ 
chitecture which accepts as input state features x, maps 
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applied by ( [Long et aL} |2014| ) can also transform and re¬ 
shape to create equally sized model output. 

3.1. Training Procedure 

The task of solving the IRL problem can be framed in the 
context of Bayesian inference as MAP estimation, maxi¬ 
mizing the joint posterior distribution of observing expert 
demonstrations, V, under a given reward structure and of 
the model parameters 0. 


Figure 2: Schema for Neural Network based reward func¬ 
tion approximation based on the feature representation of 
MDP states 


C{0) = log P(P, 0\r) = log P{V\r) + log P{0). ( 8 ) 


these to state reward r and is governed by the network pa¬ 
rameters 6 >i^ 2 ,..n- In the context of Section 2.1 the state 
reward is therefore obtained as 


r ~ g{f,0l,02,-.,0n) ( 6 ) 

= giig2{:.i9nif,0n),-),02),0i). (7) 

While many choices exist for the individual building blocks 
of a deep architecture, it has been shown that a sufficiently 
large NN with as little as two layers and sigmoid activa¬ 


tion functions can represent any binary function ( Hassoun| 
1995| ) or any piecewise-linear function ( [Hornik et al.| 1989j ) 
and can therefore be regarded as a universal approximator. 
While this holds true in theory, it can be far more com¬ 
putationally practicable to extend the depth of the network 
structure and reduce the number of required computations 
in doing so ( |Bengio|[2Q09| ). 

Importantly, in applying backpropagation, NNs also lend 
themselves naturally to training in the maximum entropy 
IRL framework and the network structure can be adapted 
to suit individual tasks without complicating or even inval¬ 
idating the main IRL learning mechanism. In the DeepIRL 
framework proposed here the full range of architecture 
choices thus becomes available. Different problem do¬ 
mains can utilise different network architectures as e.g. 
convolutional layers can remove the dependency on hand¬ 
crafted spatial features. Furthermore, it is straightforward 
to show that the linear maximum entropy IRL approach 
proposed in ( [Ziebart et al.| |2008| ) can be seen as a sim¬ 
plification of the more general deep approach and can be 
created by applying the rules of back-propagation to a net¬ 
work with a single linear output connected to all inputs with 
zero bias term. 


While the common NN architectures for whole-image clas¬ 
sification regress to fixed size outputs, the applied FCNNs 
result in an output with equivalent spatial dimensionality 
and by padding data correspondingly we realise reward 
maps of the same size as our input. It is to note here that 
padding is not the only possibility and deconvolutions as 


This joint log likelihood is differentiable with respect to the 
parameters 6 > of a linear reward model, which allows the ap¬ 
plication of gradient descent methods ( |Snyman[|2QQ5| ). We 
extend this benefit with the adaptation of Maximum En¬ 
tropy for neural networks as presented in Ct> of Equation 
P^by separating into the gradient of the loss with respect 
to the rewards r and the gradient of the reward with respect 
to the network’s weights obtained via backpropagation. 


The complete gradient is given by the sum of the gradients 
with respect to 0 of the data term Ct> and a weight decay 
term as model regulariser Cq 

dC ^ dC-D dCe 

dd dd dd ' ^ ’ 


The earlier mentioned separation of derivatives in the gra¬ 
dient of the data term is shown in equation [T^ 


dCv 

dO 


dCj) dr 
dr dO 

(mc - EM) • ^g{f,^), 


( 10 ) 

( 11 ) 


where r = As shown in (Ziebart et al. 20081, 

the gradient of the expert demonstration term Ct) with re¬ 
spect to the model parameters of a linear function is equal 
to the difference in feature counts along the trajectories. 
For higher level models this gradient can be split into the 
derivative with respect to the reward r and the derivative of 
the reward with respect to the model parameters which in 
case of a neural network is obtained via backpropagation. 
The derivative of the Maximum Entropy objective with re¬ 
spect to the reward equals the difference in state visitation 
counts between solutions given by the expert demonstra¬ 
tions and the expected visitation counts for the learned sys¬ 


tems trajectory distribution in 12 


EM= E (12) 
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Algorithm 1 Maximum Entropy Deep IRL 

Input: 

Output: optimal weights 6 >* 

1 : 0^ = initialise_weights() 

Iterative model refinement 

2 : for n = 1 : N do 
3: = nn_forward(/, 

Solution of MDP with current reward 

4: TT^ = approx_value_iteration(r’^, S, A, T, 7 ) 

5: E[/i’^] = propagate_policy( 7 r’^, S, A, T) 

Determine Maximum Entropy loss and gradients 

6: = log(7r") X nf) 

7: 


Algorithm 2 Approximate Value Iteration 

1: 

Vis) = 

- —00 

2: 

repeat 


3: 

Vt = 

V{Sgoal)=0 

4: 

Qis, 

a) = r(s, a) + E^^s^a^s') (<^0] 

5: 

v = 

softmaxa Qi{s, a) 

6: 

until maXs{V{s) — Vt{s)) < e 

7: 

7r{a\s) 

^ ^Q(s,a)-V(s) 


Algorithm 3 Policy Propagation 


1 : — 1 
2 : for i = 1 : N do 


3: ^i[fl(^SgQal)] — 0 

4: Ei+i[/i(s)] = 7r(a|5') E4/i(s')] 

5: end for 


Compute network gradients 

Iff = nn_backprop(/, 6 >", 

9: 0”+^ = update_weights(0”, 

10: end for 


6: E[ll{s)] = EiEi[Ai(s)] 


given the current policy. Additional indices representing 
the iteration of the main algorithm were omitted in these 
subscripts in favour of readability. 


Computation of E[/i] usually involves summation over ex¬ 
ponentially many possible trajectories. A more effective al¬ 
gorithm based on dynamic programming which computes 


this quantity in polynomial-time can be found in (Ziebart 


et al.||200^|Kitani et al.||2012| . Subsequently, the effective 
computation of the gradient involves first computing 
the difference in visitation counts using this algorithm and 
then passing this as an error signal through the network us¬ 
ing back-propagation. 


The presented algorithm is applied to train FCNNs based 
on the loss derivatives for all states at once. As each of the 
final state-wise rewards is infiuenced by its corresponding 
area in the original state space - its receptive field, train¬ 
ing with the summed loss over the whole scene is equiva¬ 
lent to a stochastic gradient formulation with all receptive 
fields addressed in a minibatch. This formulation is com¬ 
putationally more efficient than separate computation per 
field, since these fields overlap as soon as the width of our 
convolutional filters exceeds one ( [Long et A. 2014 1. 


The complete proposed method is described by Algorithm 
with the loss and gradient derivation in lines 6 and 7 
given by the linear Maximum Entropy formulation. The 
expert’s state action frequencies /i^, which are needed for 
the calculation of the loss are summed over the actions to 

A 

compute the expert state frequencies • 

a=l 


Lines 4 and 5 are explained in detail in the algorithm s and 
[^respectively, and are adapted from ( Kitani et al. 2012 ). 
Algorithmj^determines the policy given the current reward 
model via iterative update of the state-action value func¬ 
tion, while algorithm [^ determines the expected state vis¬ 
iting frequencies by probabilistically traversing the MDP 


4. Experiments 


We assess the performance of DeepIRL two bench¬ 
mark tasks against current state-of-the-art approaches : 


GPIRL ( ILevine et all [MTT] ), NPB-FIRL ( [Choi & Kim| 
2013| ) and the original MaxEnt ( [Ziebart et al. 2008| ) to il¬ 


lustrate the necessity of non-linear function approximation. 


All tests are run multiple times on training and transfer sce¬ 
narios for the different settings, while learning is performed 
based on synthetically generated stochastic demonstrations 
based on the optimal policy to evaluate performance on 
suboptimal example sets. This is achieved by providing 
a number of demonstrations sampled from the optimal pol- 
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icy based on the true reward structure, but including 30% 
of random actions. 


In our experiments, we employ a FCNN with two hid¬ 
den layers and rectified linear units as function approxima¬ 
tor between state feature representation and reward. This 
rather shallow networks structure suffices for the applica¬ 
tion based on strongly simplified toy benchmarks. How¬ 
ever, the whole framework can be utilised for training net¬ 
works of arbitrary capacity. All experiments except for the 
spatial feature learning in section 43 are based on filters 
of width one to focus on direct evaluation against the other 
algorithms, which are in their current form limited to the 
features of each state for reward approximation. Wider fil¬ 
ters as applied for spatial feature learning are used to eval¬ 
uate the performance on raw inputs without manual fea¬ 
ture design. For these benchmarks, we apply AdaGrad 
( Duchi et al.[ |2011| ), an approach for stochastic gradient 
descent with per parameter adaptive learning rates. Signif¬ 
icant parts of the neural network implementation are based 
on MatConvNet ( [Vedaldi & Lenc||20T^ . 


In line with related works, we use expected value difference 
as principal metric of evaluation. It is a measure of the sub¬ 
optimality of the learned policy under the true reward. The 
score represents the difference between the value function 
obtained for the optimal policy given the true reward struc¬ 
ture and the value function obtained for the optimal policy 
based on the learned reward model. Additionally to the 
evaluation on each specific training scenario, the trained 
models are evaluated on a number of randomly generated 
test environments. The test on these transfer examples 
serves to analyse each algorithm’s ability to generalise to 
the true reward structure without over-fitting. 


4.1. Objectworld Benchmark 


The Objectworld scenario ( [Levine et al.[ [201 Ij ) consists of 
a map of M x M states for M = 32 where possible ac¬ 
tions include motions in all four directions as well as stay¬ 
ing in place. Two different sets of state features are im¬ 
plemented based on randomly placed colours to evaluate 
the algorithms. For the continuous set x G Each 

feature dimension describes the minimum distance to an 
object of one of C colours. Building on the continuous 
representation the discrete set includes C x M binary fea¬ 
tures, where each dimension indicates whether an object of 
a given colour is closer than a threshold d G {1,..., M}. 


The reward is positive for cells which are both within the 
distance 3 of color 1 and distance 2 of color 2, negative if 
only within distance 3 of color 1 and zero otherwise. This 
is illustrated for a small subset of the state space in Figure 

a 


In line with common benchmarking procedures, we evalu- 



^ ^ ^ ^ Optimal Policy 

9 0 Example Reward-building Objects 


O O Example Distractor Objects 




Reward (low to high) 


Figure 3: Objectworld benchmark. The true reward is dis¬ 
played by the brightness of each cell and based on the sur¬ 
rounding object configuration. Only a subset of colors in- 
fiuences the reward, while the others serve as distracting 
features. 


ated the algorithms with a set number of features and in¬ 
creasing demonstrations. Additionally, the learned reward 
functions are deployed on randomly generated transfer sce¬ 
narios to uncover any overfitting to the training data. 

While the original MaxEnt is unable to capture the nonlin¬ 
ear reward structure well, both DeepIRL and GPIRL pro¬ 
vide significantly better approximations as represented in 
Eigure [^ Evaluation of NPB-EIRL on this benchmark was 
done in |Choi & Kim||T013| ) where it showed a similar level 
of performance as GPIRL. GPIRL generates a good model 
already with few data points whereas DeepIRL achieves 
commensurate performance when increasing the number 
of available expert demonstrations. The same behaviour 
is exhibited when using both continuous and discrete state 
features (Eig. [^. The requirement for more training data 
will be rendered unimportant in robot applications based 
on autonomous data acquisition, while enforcing the lower 
algorithmic complexity as dominant advantage of the para¬ 
metric approach. 



Eigure 4: Reward reconstruction sample in Objectworld 
benchmark provided N = 64 examples and (7 = 2 colours 
with continuous features. White - high reward; black - low 
reward. 


Additional tests are performed with increased number of 
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distractor features to evaluate each approach’s overfitting 
tendency. The corresponding figures are left out due to lim¬ 
ited space. Both DeepIRL and GPIRL show robustness to 
distractor variables, though DeepIRL shows minimally big¬ 
ger signs of overfitting as the number of distractor variables 
is increased. This is due to the NN’s capacity being brought 
to bear on the increasing noise introduced by the distractors 
and will be addressed in future work with additional regu- 
larisation methods, such as Dropout ( [Hinton et al.| |2012| ) 
and ensemble methods. 






a) b) 




c) 


d) 


Figure 5: Objectworld benchmark. From top left to bottom 
right: expected value difference (EVD) with C = 2 colours 
and varying number of demonstrations N for training a) 
and transfer case b) with continuous and subsequently with 
discrete features in c) & d) ; As the number of demonstra¬ 
tions grows DeepIRL is able to quickly match performance 
of GPIRL on the task. 


4.2. Binaryworld Benchmark 

In order to test the ability of all approaches to successfully 
approximate more complex reward structures, the Binary- 
world benchmark is presented. This test scenario is similar 
to Objectworld, but in this problem every state is randomly 
assigned one of two colours (blue or red). The feature vec¬ 
tor for each state consequently consists of a binary vector of 
length 9, encoding the colour of each cell in its 3x3 neigh¬ 
bourhood. The true reward structure for a particular state 


Figure 6: Value differences observed in the Binaryworld 
benchmark for GPIRL, MaxEnt and DeepIRL for the train¬ 
ing scenario (left) and the transfer task (right). 


is fully determined by the number of blue states in its local 
neighbourhood. It is positive if exactly four out of nine 
neighbouring states are blue, negative if exactly five are 
blue and zero otherwise. The main difference compared 
to the Objectworld scenario is that a single feature value 
does not carry much weight, but rather that higher-order 
relationships amongst the features determine the reward. 

Since the reward depends on a higher representation for 
the basic features - that is to say the number of specific 
features - such case is arguably more challenging than the 
original Objectworld experiment and a good performance 
on this benchmark implies the algorithm’s ability to learn 
and capture this complex relationship. 

The performance of DeepIRL compared to GPIRL, lin¬ 
ear MaxEnt and NPB-LIRL is depicted in Lig. In 
this increasingly more complex scenario, DeepIRL is able 
to learn the higher-order dependencies between features, 
whereas GPIRL struggles as the inherent kernel measure 
can not correctly relate the reward of different examples 
with similarity in their state features. GPIRL needs a larger 
number of demonstrations to achieve good performance 
and to determine an accurate estimate on the reward for 
all 2^ possible feature combinations. 

Perhaps surprising is the comparatively low performance 
of the NPB-LIRL algorithm. This can be explained by the 
limitations of this framework. The true reward in this sce¬ 
nario can not be efficiently described by the logical con¬ 
junctions used. In fact, it would require 2^ different logical 
conjunctions, each capturing all possible combinations of 
features, to accurately model the reward in this framework. 

Lig. [T] shows the reconstruction of the reward structures es¬ 
timated by DeepIRL, MaxEnt and GPIRL. While GPIRL 
was able to reconstruct the correct reward for some of the 
states having features it has encountered before it provides 
inaccurate rewards for states which were never encoun- 
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Groundtmth 


GPIRL 




DeepIRL 


MaxEnt 




Objectworld Binary world 


Figure 7: Reward reconstruction sample for the Bina¬ 
ry world benchmark provided N = 128 demonstrations. 
White - high reward; black - low reward. 


Figure 8: Application of convolutional layers for spatial 
feature learning. Spatial feature learning quickly converges 
to performance with optimally designed features. 


tered. It produces an overall too smooth reward function 
due to assumptions and priors in the GP approximation. 
On the other hand, DeepIRL is able to reconstruct it with 
high accuracy demonstrating the ability to effectively learn 
the highly-varying structure of the underlying function. 

4.3. Spatial Feature Learning 

While the earlier benchmarks visualise performance com¬ 
pared to current algorithms in the context of precomputed 
features, the approach can be extended via the use of wider 
filters to eliminate the requirement of preprocessing or 
manual design of features. Figure [^represents the results 
for both earlier benchmarks, but instead of using the earlier 
described feature representations, the FCNN builds the re¬ 
ward based on the raw input representation, which for each 
state only includes the availability of each specific object at 
that specific state. All spatial information is derived based 
on the convolutional filters. Based on the simplicity of the 
benchmarks, we employed a five layer approach with 3x3 
convolutional kernels in the first two layers. By increasing 
the depth of the network and include convolutional filters, 
we add enough capacity to enable the learning of features 
as well as their combination into the reward function in the 
same architecture and process. 

Due to the increasing number of parameters, the approach 
requires additional training data to perform at equal accu¬ 
racy but with increasing number of expert samples con¬ 
verges towards the performance with predefined features. 
Since the given features in these simplified toy problems 
are optimal and the true reward is directly calculated on 
their basis, automatically learned features cannot exceed 
the performance. However, in real-world scenarios, the 
compression of raw data - such as images - to feature rep¬ 
resentations leads to information loss and the learning of 
task-relevant features gains even more importance. 


5. Conclusion and Future Work 


This paper presents Maximum Entropy Deep IRL, a frame¬ 
work exploiting FCNNs for reward structure approxima¬ 
tion in Inverse Reinforcement Learning. Neural networks 
lend themselves naturally to this task as they combine rep¬ 
resentational power with computational efficiency com¬ 
pared to state-of-the-art methods. Unlike prior art in this 
domain DeepIRL can therefore be applied in cases where 
complex reward structures need to be modelled for large 
state spaces. Moreover, training can be achieved effectively 
and efficiently within the popular Maximum Entropy IRL 
framework. A further advantage of DeepIRL lies in its ver¬ 
satility. Custom network architectures and types can be de¬ 
veloped for any given task while exploiting the same cost 
function in training. This is expressed in section 4.3 where 
convolutional filters are applied to eliminate the need of 
manual feature design. 


Our experiments show that DeepIRL’s performance is com¬ 
mensurate to the state-of-the-art on a common benchmark. 
While exhibiting slightly increased requirements regard¬ 
ing training data in this benchmark, a principal strength 
of the approach lies in its algorithmic complexity indepen¬ 
dent of the number of demonstrations samples. Therefore, 
it is particularly well-suited for life-long learning scenarios 
in the context of robotics, which inherently provide suffi¬ 
cient amounts of training data. We also provide an alter¬ 
native evaluation on a new benchmark with a significantly 
more complex reward structure, where DeepIRL signifi¬ 
cantly outperforms the current state-of-the-art and proves 
its strong capability in modeling the interaction between 
features. Furthermore, we extend the approach to wider fil¬ 
ters in order to eliminate the dependency on precomputed 
features and to emphasise the adaptability of framing IRL 
in the context of deep learning. 

In future work we will explore the benefits of autoencoder- 
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style pretraining to reduce the increased demand of expert 
demonstrations when employing wider convolutional fil¬ 
ters. Especially when based on more complex inputs such 
as raw image data, the easily available unsupervised train¬ 
ing data will help to learn features which then only need to 
be refined during the supervised IRL-based training phase. 
Due to the variety of existing work on FCNN architectures 
mentioned in section[2 we expect to be able to benefit from 
applying more complex networks for real life problems, 


such as the skipping architecture by ( [Long et aL| |2014| ), 
which enables the concatenation of fine structural informa¬ 
tion alongside with coarser higher level features in the last 
regression layer to improve overall performance in evaluat¬ 
ing features of multiple scales. Furthermore, other methods 
for optimising demonstration data likelihood such as given 
by ( [Babes et^ 2011| ) will be evaluated. 
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